For this project, I will attempt to classify handwritten digits using SVD's left-singular vectors as fundamental subspaces for each digit.
The approach will be to generate one distinctive subspace of orthogonal vectors - one U matrix for each class/digit - using Singular Value Decomposition. After creating these matrices, a classifier can be constructed to assign any new unknown image into a class (0, 1, 2, ... ,9).
In the first code block, I'm just importing some data for images. The dataset contains one column for each pixel (28x28 images, so 784 column vectors), plus one column for the label (as we can see in the front of the dataframe below).
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv("train.csv")
data.head()
Out[1]:
So that's what the dataset looks like, each pixel is a column, each row is an image in the dataset. Actually I have a corresponding 'test' dataset but I'll just separate this one into train/test later on since it's big enough.
All values are grayscale 0-255. Here are some images below. I just reshaped each image vector into 28x28 for showing.
In [2]:
print data.iloc[1][0]
plt.imshow(data.iloc[1][1:].reshape(28,28),cmap='Greys')
plt.show()
print data.iloc[28][0]
plt.imshow(data.iloc[28][1:].reshape(28,28),cmap='Greys')
plt.show()
In [3]:
np.unique(data['label'],return_counts=True)
Out[3]:
In the below bit of code, I made a function to separate the dataset into just 5 classes for demonstration purposes (only 0, 1, 2, 3, and 4). I also separated each of these digit datasets into a train and test dataset by assigning the first 80% of the dataset to train, and the rest for test. This is for each digit dataset.
I then compute the SVD for each train dataset so that I will have zero_u, one_u, ... , four_u. These will be the left singular vectors which I will use later for classification purposes.
In [14]:
def getTrainTest(digit):
digit_train = digit[0:int(len(digit)*.8)]
digit_train = digit_train[digit_train.columns[1:]].T
digit_test = digit[int(len(digit)*.8):]
digit_test = digit_test[digit_test.columns[1:]]
return (digit_train,digit_test)
zero = data[data['label']==0]
zero_train,zero_test = getTrainTest(zero)
one = data[data['label']==1]
one_train,one_test = getTrainTest(one)
two = data[data['label']==2]
two_train,two_test = getTrainTest(two)
three = data[data['label']==3]
three_train,three_test = getTrainTest(three)
four = data[data['label']==4]
four_train,four_test = getTrainTest(four)
zero_u,e,v = np.linalg.svd(zero_train,full_matrices=False)
one_u,e,v = np.linalg.svd(one_train,full_matrices=False)
two_u,e,v = np.linalg.svd(two_train,full_matrices=False)
three_u,e,v = np.linalg.svd(three_train,full_matrices=False)
four_u,e,v = np.linalg.svd(four_train,full_matrices=False)
#Regarding full_matrices = False
#If True, U and Vh are of shape (M,M), (N,N).
#If False, the shapes are (M,K) and (K,N), where K = min(M,N).
In [15]:
print zero_u.shape
print e.shape
print v.shape
Seeing the singular vectors reconstructed is pretty cool. Especially the '0' one. See below for '3' and '0'.
In [16]:
for i in range(4):
plt.imshow(three_u[:,i].reshape(28,28),cmap='Greys')#first 5 columns of U
plt.show()
In [17]:
for i in range(4):
plt.imshow(zero_u[:,i].reshape(28,28),cmap='Greys')#first 5 columns of U
plt.show()
Below I have written a classification function and I just went through each of the test datasets to classify them as a digit. I used the U matrices of each of the digits (zero_u, one_u, etc.) as the 'classifier' for each class, and I iterated through them to get their resulting output. The minimum of the output is the class which is assigned by this function.
The function utilizes the residual given by $\lVert z - U\alpha \lVert _2$, and to minimize this, we just have $\alpha = U^Tz$. The idea behind it is that the linear combination of singular vectors is most appropriate for the digit which it is constructed from, so an unknown '0' image would have the smallest residual with the '0' $U$ matrix residual.
In [18]:
def classifyUnknownDigit(newDigit):
classes = [zero_u,one_u,two_u,three_u,four_u]
values = []
for U in classes:
values.append(np.linalg.norm((np.identity(len(U))-np.matrix(U)*np.matrix(U.T)).dot(newDigit),ord=2)/np.linalg.norm(newDigit,ord=2))
return values.index(min(values))
zero_pred = []
one_pred = []
two_pred = []
three_pred = []
four_pred = []
for i in range(len(four_test)):
four_pred.append(classifyUnknownDigit(four_test.iloc[i]))
for i in range(len(zero_test)):
zero_pred.append(classifyUnknownDigit(zero_test.iloc[i]))
for i in range(len(two_test)):
two_pred.append(classifyUnknownDigit(two_test.iloc[i]))
for i in range(len(one_test)):
one_pred.append(classifyUnknownDigit(one_test.iloc[i]))
for i in range(len(three_test)):
three_pred.append(classifyUnknownDigit(three_test.iloc[i]))
print "Accuracy"
print "------------"
print "0: ", zero_pred.count(0)/1.0/len(zero_pred) #count the number of 0's, divide by length of list to get accuracy.
print "1: ", one_pred.count(1)/1.0/len(one_pred)
print "2: ", two_pred.count(2)/1.0/len(two_pred)
print "3: ", three_pred.count(3)/1.0/len(three_pred)
Woops forgot to print the 4 predictions. The above computation takes a few minutes and really beats up the CPU.
In [19]:
print "4: ", four_pred.count(4)/1.0/len(four_pred)
It's interesting to see how some digits are classified well like 1, 2, and 3, whereas the 0 and 4 are not as great. Still, 95% is pretty bad for real-world applications, especially for something as important as sending mail to the right people. The article talks about just letting humans classify the digits manually if it's not completely certain about a prediction (if there is no large distinction between values among the classifier matrices).
The function below prints the classes and number of times each class was selected for the zero_pred list. We can see that 0 was classified correctly 374 times, and misclassified a lot as a 3, then as a 2, and almost never as a 1 or 4. I guess the curves of the 2 and the 3 might have to do with this since 0's are curvey as well.
In [20]:
np.unique(zero_pred,return_counts=True)
Out[20]:
Similarly doing this for the four classifier, we see that it's pretty good but it's confusing 4's for 3's more than 1's,2's, and 0's. I guess the middle line of the '3' digit might be contributing to some confusion.
In [21]:
np.unique(four_pred,return_counts=True)
Out[21]:
In [ ]: